-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Backtransform data before mapping statistics #4194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backtransform data before mapping statistics #4194
Conversation
Something makes me uncomfortable about this. At a minimum, I think this needs much more documentation to really explain the logic at each step. |
will this not potentially break every plot that has ever used I must admit that I think it is better to document what has happened to the data up until the |
@thomasp85 Something like this PR may very well be needed. I've dealt with a similar problem in coords. However, I think we should do some more due diligence before merging. It'd be a breaking change, but I think the cases where this would hit would be quite rare, since it requires a combination of using I'm providing another reprex, using the current ggplot2 (without this patch), that shows that the current behavior is confusing and inconsistent with how everything else in ggplot2 works.
Created on 2020-09-06 by the reprex package (v0.3.0) |
I'm not convinced, but I'm also not 100% against. Scale transformations are applied as the very first thing in the data pipeline which is something there is value in teaching and understanding. It seems weird to me to cherry-pick parts of the data-pipeline and undue them at certain points... Again, not saying this shouldn't be done, but I'd be weary of doing this since, as you note yourself, this is an extreme edge case and the change required is not nothing... |
Another example. This one I find even more disconcerting, because it produces a plot that is objectively wrong.
Created on 2020-09-06 by the reprex package (v0.3.0) |
Well, it is only objectively wrong because we strip |
A possible compromise could be to leave |
I'd much rather explore this (or make it an argument to |
I too feel unhappy about this PR, and half of the purposes of this is to share the uncomfortableness with you so that you'll come up with some superseding solution, as you always did :)
Yeah, actually it made me think back-transformation is needed here as well.
Probably we can easily skip the back-transformation by checking if the trans is identity? Besides, the back-transformation only happens when there's some Stat (otherwise it returns early), so I don't think it makes up 99.9%. Whether or not we end up adopting this, I think this is worth breaking change as the name |
I'm not suggesting changing these names again, but I did realize today that I've been always confused by |
Sorry - I confused myself as to when this function would get called.
I'm sure |
@clauswilke what kind of related issues have you face in coords? |
For geoms such Lines 46 to 48 in ac2b5a7
Getting this all right was quite tricky, because in many places the code wasn't very clear about whether it needed a regular range or a backtransformed range. It took quite a while to disentangle all of this. I tried to document this here: Lines 16 to 30 in ac2b5a7
|
I agree with this part. While the word "scale" might have different meanings in different places in ggplot2, probably it still has cleaner meaning than "mapping," which is too general. |
Team - should we revive this for the upcoming release? I don't think my stance has changed all that much, but I'd be happy to consider an argument — I don't think a new function is the correct way because there would be no way to port this over to |
I just reread the entire thread and I'm not sure I have a useful opinion either way. Maybe it would be a good idea to bring in @teunbrand, since he filed the original issue that brought this up. If we want to go forward with this, I think it needs an empirical approach. Implement a version of the suggested idea and see if it breaks things or has performance implications. |
I do have a preference that the |
R/layer.r
Outdated
@@ -299,6 +299,9 @@ Layer <- ggproto("Layer", NULL, | |||
# evaluation (since the evaluation symbols gets renamed) | |||
data <- rename_aes(data) | |||
|
|||
# data needs to be non-scaled | |||
data_orig <- scales_backtransform_df(plot$scales, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be able to alleviate Thomas' concern that this is applied to 99% of plots, by executing this line after the early exit at the if (length(new) == 0) return(data)
line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, thanks!
I think we might even further reduce the impact of this PR on performance if we selectively backtransform the relevant columns, and leave other columns as-is. # Suppose these are the aesthetics to evaluate
new <- aes(x = after_stat(y + 1), colour = 2, fill = Species)
# Extract all variables occuring in the aesthetics
vars <- unlist(lapply(new, all.vars), use.names = FALSE)
# Only backtransform variables that occur
data_orig <- scales_backtransform_df(scales, df[intersect(names(df), vars)]) Together with Hiroaki's suggestion earlier:
I think the performance hit might be contained to a minimum. |
Thanks!
I think this works in most cases, there's chance that some variable is referenced indirectly (e.g. |
If the concern is the performance, I think I addressed it. |
@yutannihilation is this ready for review? |
Yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a news bullet indicating that this is a breaking change (though quite niche)
otherwise it's good to merge
Thanks! Let me find some good explanation for NEWS. |
Added a test and a NEWS item. |
Fix #4155
On #4155 (comment), I wrote
but, I was wrong. The transformation that the data needs to be bypassed is not the one after mapping the calculated variables, but the one before mapping. Otherwise, if we do another calculation over the calculated value, the result is still wrong even if it skip the transformation at the end of
Layer$map_statistic()
.Here's an example. When evaluating
after_stat
, the actual scaled value ofx
is4
, and the result ofx / 2
will be2
, which will locate at4
in sqrt scale. But, in actual,x / 2
is8
.The transformation we want to bypass is this one. But, the operations that come after this (i.e. training and mapping positions) depends on the scaled values, so we cannot skip this transformation.
ggplot2/R/plot-build.r
Lines 60 to 61 in 6b8dba0
After some thinking, I come to a conclusion that the values need to be back-transformed before evaluating
after_stat
. There might be cleaner mechanism for this, but I believe this is a realistic solution at the moment...Created on 2022-05-15 by the reprex package (v2.0.1)